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1 .    Introduction. 

Modern  microelectronic  technology   is  likely  to  make 
possible  the  design  and  production  of   a  fast  floating  point 
processor  on  a  chip.   It  is  interesting  to  contemplate  the 
level   of   computing  power  that  could  be  attained  if 
thousands  of  such  processors  could  be  organized  to  work 
together  effectively  for  the  solution  of  major  scientific 
problems. 

The  NYU  "Ultracomputer"  group  has  been  engaged  in  a 
study  to  see   ho'-?   such   ense-rbles    can  be  constructed 
and  used  effectively.   A  tentative  design  has  evolved  along 
with  novel  hardware  and  software  features  that 
facilitate  the  preparation  of  scientific  programs  and 
their  transformation  to  parallel  form.   It  is  our  purpose  here 
to  view  our  work  concerning  scientific   programming   on 
such  a  machine.   It  will  be  necessary  first  to  give  a 
very  brief  outline  of  the  hardware,  and  also  of  a  few 
relevant  software  issues. 


The  machine  under  study  at  NYU  consists  of  a  large 
number  of  VLSI  processing  elements  ("PEs")  --  in  the 
nominal  design  there  are  4096  --  connected  by  a  switching 
network  to  a  large  number  of  memory  modules  ("MMs") . 
Our   design   is   an   example  of  an   MIMD   machine  —  one 
in  which  multiple  instruction  streams  act  upon  multiple 
sets  of  data.   It  is  convenient,  although  not  necessary, 
to  use  roughly  as  many  memory  modules  as  processors  but 
there  is  no  special  relation  between  specific  (PEs) 
and  specific  MMs.   One  should  rather  think  of  the  ensemble 
of  memory  as  constituting     a  global  memory  which  can 
be  accessed  publicly  by  all  PEs.   It  is  also  likely 
that  each  PE  will  have  some  local  memory,  probably 
organized  as  a  cache.   The  switching  network  will  be  of 
a  type  (see  Figure  1)  that  permits  any  PF  to  be  connected 
to  any  MM   by  a  unique  path  and  that  requires  a  number  of 
stages  equal  to  the  logarithm  of  the  total  number  of  PEs. 
In  our  design,  each  node  of  the  switching  network  is   also 
a     special  purpose   computing   element   with  a  small 
memory   and  the  capacity  to  do  logical  and  addition  opera- 
tions.  The  connection  network  requirements  are  mild  enough 
for  one  of  the  switches  to  fit  easily  on  a  single  VLSI  chip. 

We  allow  the  entire  processor  array  to  be  used 
in  a  multiprocessing  multitasking  operating  system. 
That  is,  the  PEs  can  be  assigned  either  to  totally 
unrelated   work  for  different  users   or  to  cooperate 
on  one  job  for  a  single  user.  A  simple  example  of  a  program 


easily  organized  to  use  many  PKs   successfully  is  a 
Monte  Carlo  program  where,  under  many  circumstances, 
it  is  possible  to  assign  independent  samples  to 
separate  processors . 

We  expect  that  no  processor  would  act  exclusively 
or  primarily  as  a  system  monitor  or  organizer.  Instead, 
the  operating  system  will  be  logically  distributed, 
with  each  processor  performing  fragmentary  operating  system 
actions.   More  specifically,  when  a  processor  finishes 
an  assigned  task  it  will  interrogate  a  global  queue  of 
tasks  awaiting  attention,  claim  a  task,  and  go  on  to 
perform  this  task. 

It  is  clear  that  communication  among  PEs   is  a 
question  crucial  for  efficient  operation.   This  has 
received  much  attention  in  our  work,  and  promising  solu- 
tions have  emerged.   An  elementary  but  very  effective 
communication-synchronization  primitive   suffices. 
This  is  the  so-called  "replace-add"  instruction.  An 
example  of  its  use  is 

J  =  IRAd,  2*L  +  1) 

which  causes  the  variable  I,  presumably  in  publicly  access- 
ible  memory,  to  be  incremented  by  the  expression  (2  *  L  +  1) 
computed  by  a  PE.   The  new  value  I  +  2*L  +  1  is  immediately 
stored  at  I  and  is  also  returned  as  the  value  of  the  function 
IRA   and  is  therefore  assigned  to  J  which  is  likely  to  be  local 
to  the  calling  PE.  It  is  crucial  to  the  functioning  of  the 
IRA  operation  that  many  PEs   can   simultaneously   address 


IRA  instruction  to  the  same  public  variable:  the 
result  must  then  conform  to  some  possible  serial  order 
of  execution.   Suppose,  for  example,  that  I  is  initially 
zero  and  that  PEO  and  PEl  concurrently  execute  IRA (1,1). 
Then  either  1  will  be  returned  to  PEO  and  2  to  PEl  or 
else  2  to  PEO  and  1  to  PFl .   In  either  case  I  becomes  2. 
In  spite  of  the  partial  indeterminacy   of  the  results  of 
concurrent  IRA  operations,  it  is  not  difficult  to  write 
basic  communication  routines  that  work  correctly  and, 
very  important,  do  not  require  that  PEs  wait  unnecessarily. 

A  simple  example  of  the  use  of  the  IRA  arises  in 
the  construction 

D0  100  1-^1,1000 
100   CALL  SUB(X,Y,Z,I) 
where  the  calls  can  be  executed  concurrently  by  separate 
processors  providing  each   receives  a  unique  value  of  I. 
To  parallelize  this  loop  a  new  public  variable  K  would  be 
defined,  initialized  to  zero,  and  1000  tasks  would  be  created, 
Each  task  would  first  execute  J  =  IRA(K,1)  so  that  each 
processor  obtains  a  unique  index  value  for  execution  of  SUB 
rather  than  using  a  parameter  of  the  call. 

Our  work  has  shown  that  the  IRA  instruction  can  be  used 
to  program  a  wide  variety  of  other  important  synchronization 
primitives,  including  the  following: 

(i)    Code  which  causes  multiple  cooperating  PEs  to 
wait  until  a  required  stage  of  a  computation  is  reached 
by  other  PEs. 


(ii)  Parallel   insertion  and  deletion  of  items  from 
queues.   (We  have  programmed  this  latter  example  in  terms  of  IRAs 
in  such  a  way  that,  except  v/hen  the  queue  is  nearly  full 
or  nearly  empty,  entries  can  be  inserted  and   deleted 
concurrently  by  many  PEs .   This  allows  us  to  utilize  a 
very  large  number  of  PEs   with  high  efficiency.) 

The  IRA  operation  can  be  implemented  efficiently 
in  hardware  by  using  the  nodes  of  the  switching  network 
to  test  for  and  combine  concurrent  IRAs   to  the  same  loca- 
tion.  This  is  the  main  reason  that  we   choose  to  make  our 
communication  nodes  more  than  simple  two  by  two  switches. 

We  have  developed  a  preliminary  design  for  PE  and  MM 
arrays,  together  with  a  suitable  switching  network.   Our 
design  includes  local  caches,  which  are  used  to  diminish 
traffic  on  the  communication  network.   Preliminary  simulations 
indicate  that  a  network  of  the  sort  proposed  can  handle 
processor-to-memory  communication  effectively  enough  for  it 
to  disappear  as  an  issue  of  direct  programmer  concern. 


2 .    Aims  of  Scientific  Software  Studies 

At  the  same  time  as  our  general  hardware  and  software 
designs  evolved,  we  considered  questions  of  scientific 
programming.   In  order  to  be  able  to  transform  existing 
software  easily,  we  decided  to  use  FORTRAN,  supplemented 
with  a  few  necessary  new  constructions.   We  see  the  aims 
of  our  scientific  programming  studies  as  the  following. 

(a)  To  interact  with  hardware  and  software  design 
so  as  to  insure  a  result  that  would  permit 
straightforward  and  efficient  scientific 
programming . 

(b)  To  study  the  relationship  of  program  architecture 
to  the  structure  of  parallel  programs.  An 
important  example  is  the  use  of  high  level  versus 
low  level  parallelism. 

(c)  To  study  the  problems  of  writing  new  parallel 
software. 

(d)  To  study  the  transformation  of  existing  (sequential) 
software. 

(e)  To  study  the  problem  of  testing  and  debugging 
parallel  programs. 

(f)  To  study  the  issues  affecting  efficiency  for 
multiple  processors. 

(g)  To  parallelize   and  document  a  pot-pourri 

of  problems,  varying  from  library  routines  to  complete 
programs,   and     to  measure  their  effectiveness, 
(h)  To  interact  with  scientists  having  a  potential 
need  for  very  large  computing  power. 


The  main  tool  used  in  our  scientific  programining  effort  has 
been  a  sinulator  which  interprets  the  machine  language 
of  the  CDC  6600  plus  a  few  extra  instructions  needed  for 
parallel  synchronization  (most  notably  the  replace-add) . 
This  gives  us  a  simulated  system  consisting  of  many 
processors;  simulations   have  been  carried  out  with  as 
many  as  256   apparent  processors.  As  of  this  writing,  most 
of  our  simulations  have  not  taken  any  account  of  delays 
in  the  switch  network.   However,  a  few  simulations  have 
modeled   the  network  correctly;  these  will  be  described 
below. 


3 .    A  Spectrum  of  Program  Architectures 

Although  it  is  not  possible  to  characterize  the 
structure  of   our   parallel   scientific    programs   in 
any  simple  universal  way,  it  is   useful  to  contrast  two 
limits  for  programs  intended  for  a  parallel  computer. 
At  one  end  lies  the  simple  Monte  Carlo  program  referred 
to  above,  in  which  different  processors  act  independently 
on  independent   samples.    Such  a  program  requires  some 
initialization  and  a  final  phase  in  which  data  from  multiple 
independent  tasks  can  be  collated;  but  for  the  most  part, 
the  processes  into  which  it  can  be  divided  are  very  weakly 
coupled.   Correspondingly,  the  transformation  of  an  existing 


(sequential)   program   of  this  type  into  a  parallel  program 
requires  only  a  few   very   high   level  changes,   all  in 
an  executive  phase  of  the  program. 

The  other  limit  of  parallelization  is  typified  by 
very  large  algebraic  systems  --  systems  so  large    that 
it  is  useful  to  partition  computational  labor 

finely  to  accelerate  comoletion.   To  achieve  useful 
speedup,  such    calculations  will  have  to  be  divided  into 
parallel  processes  which  are  coupled  at  a  much  lower  level 
than  in  the  Monte  Carlo  case,  and  this  requires  closer 
examination  of  the  underlying  algorithm. 


4 .    An  account  of  our  simulated  computations . 
4 A   Monte-Carlo  Computations 

We  began  by  parallelizing  several  Monte  Carlo  programs 
The  first  was  an  elementary  quadrature  which  exemplified 
the  possibility  of  independent  sampling  in  different 
processors.   A  very  simple  random  walk  problem  drawn  from 
the  theory  of  radiation   transport  made  the  same  point 
in  a  slightly  more  sophisticated  way. 

A  third  Monte  Carlo  program  treated  cascades  (such  as 
would  arise  in  the  theory  of  cosmic  ray  showers) .  Instead 
of  keeping  processors  maximally  independent,  we  made  the 
program  more  challenging  by  entering  into  a  queue  those 
extra  tasks  arising  from  the  branching  of  an  initial 
particle.   Processors  finishing  one  branch  were  then  ■■ 


programmed  to  draw     an  entry  from  the  aueue  for  continued 
processing.   A  first  version  of  this  program,  in  which 
the  queue  was  locked  during  entry  and  deletion,  showed 
clearly   decreasing  processor  utilization  as  more  processors 
were  brought  into  the  simulation.   This  was  the  immediate 
impetus  for  the  development  of  the  improved  treatment  of 
queues  mentioned  above.   Once  improved,  ' replace-add' 
based  queues  were  used,  the  program  remained  fully  effective 
as  processors  were  added  —  that  is,  the  number  of  instruc- 
tions executed  by  any  PE  was   inversely  proportional  to 
the  total  number  of  PEs . 

Not  every  Monte  Carlo  calculation  is  quite  as  easily 
divided  into  independent  tasks   as  the  cases  just  reported. 
To  explore  the  implications    of  this,  we  choose  a  more  com- 
plex example   to  study:   simulation  of  the  equilibrium 
properties  of  a  classical  fluid  using  the  algorithm  of 
Metropolis  et  al  (2) .   In  this  method,  a  fluid  is  described 
by  giving  the  coordinates   (r^^ ,  r^  ,  .  .  .  ,  r^^)    of  a  large 
ensemble  of  particles  and  by  specifying  the  potential 
energy  V  (r-j^ ,  r2  ,  .  .  -  ,  r  )  of  the  configuration.   It  is  common 
to  treat  such  systems  by  assuming   a  "pairwise  additive 
central  potential"   i.e.   assuming  that 

V(r^, . . . ,r^)   =    I      v( |r.  -  r  .  I )  . 

i<j  ^ 

The  algorithm  then  proceeds  by  iteration  of  the  following 

sequence  of  steps  : 
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(a)  Select  a  particle  at  random.  This  is  called  a  "try". 

(b)  Move  that  particle  a  random  amount. 

(c)  Calculate  the  change  in  the  potential  which 
results  from  that  move. 

(d)  On  the  basis  of  the  change,  decide  whether  to 
accept  or  reject  the  move.  If  it  is  rejected 
the  moved  particle  is  returned  to  its  original 
position.  If  accepted,  the  moved  particle  is 
left  at  the  new  position  and  various  pointers 
are  updated. 

Steps  (a) ,  (b)  and  (d)  are  inherently  sequential   but 
rather  short.   Step  (c)  is  easy  to  split  into  many  tasks 
that  can  be  carried  out  in  parallel.   This  step  is  also  by 
far  the  most  time  consuming  part  of  the  calculation  and 
would  be  even  more  dominant  in  those  problems  that  really 
need  the  power  of  an  ultracomputer ,  namely  large  systems 
of  complex  molecules. 

Our  experimental  program  did  not  deal  with  a  physically 
complex   system  but  was  written  as  though  it  did.  Specifi- 
cally, we  treated  a  system  of  particles  on  a  ring,  interacting 

1  -J  c 

via  Lennard-Jones  potentials:  v(x)  =  4£[(x/a)   -(x/a)  ] 

where  e  and  a  are  energy  and  distance  scales.   Our  code 

used  "neighbor  tables"  which  keep  track  of   potential 

near  neighbors  of  each  particle.   This  is  a  much-used  strategy 

for  speeding  up  calculations  of  this  kind.   The  issue  for  us 

is  that  it  complicates  the  program   and  adds  to  the  amount 

of  work  carried  out  in  step  (d) . 
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It  would  probably  have  been  acceptable  to  have  absorbed 
the  penalty  of  the  three  short  sequential  stages  of 
Metropolis'   algorithm,   since   it  is   reasonable  to 
assume  that  in  a  parallel  computer  of  this  kind  we 
envision   there  would  be  other  users  whose  jobs  would  occupy 
available  processors  during  sequential   periods  of  our 
program.   However,  we  did  consider  what  could  be  done  without 
'  depending  on  the  kindness  of  strangers',   in  this  way;  i.e.  we 
did   develop   techniques   to   make   the   algorithm  as 
efficient  as  possible   when  run  alone.   This  was  accomplished 
by  staggering  even  and  odd  tries  so  that  the  even  tries 
filled  in  (with  parallel  tasks) ,  the  sequential  gap  in 
the  odd  series  and  vice-versa.   Thus,   steps  (a)  and  (b) 
of  try   1  is   first  carried   out,  leading  to  the 
creation  of  the  many  tasks  of  step  (c) .   After  these  last 
were  initiated,  the  processor  which  handles  steps  (a) 
and  (b)  continues  by  initiating  (a)  and  (b)  of  try   2, 
thereby  creating  tasks  (2C)  to  be  done  in  parallel.   The 
latter  are,    in  effect,  given  lower  priorities  than  (Ic) . 
Then  the  processor  that  completes  the  last  (Ic)  task 
carries  out  (Id)  and  initiates  (3a),  (3b)  and  (3c),  etc. 

As  with  other  problems  to  be  described  below, 
this  program  was  run  with  varying  numbers  of  physical 
particles  and  with  varying  numbers  of  simulated  processors. 
A  careful  analysis  of  the  dependence  of  running  time  on 
these  quantities  was  made  for  the  program  as  written. 
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This  gives  us  a  formula  for  extrapolating  to  determine 
the  performance  of  very  large  systems.  Analysis  can  be 
extended  to  cover  three  dimensional  problems.  We  can 
summarize  our  results  by  giving  the  efficiency,  E  =  T(1)/[PT(P) 
with  which  N  processors  can  be  used  in  a  given  problem;  here 
T(P)  is  the  time  required  when  P  processors  are  used.  For 
a  small  problem  with  a  total  of  65  neighbors  per  particle 
distributed  over  27  boxes  our  formula  predicts  that  efficiency 
will  drop  from  98%  for  16  processors  to  66%  for  512  processors. 
For  a  more  realistic  problem,  with  10000  possible  neighbors 
in  the  27  boxes,  predicated  efficiency  only  drops  to  82%  if 
16384  processors,  are  used.   These  extrapolations  confirm 
the  effective  usability   of  large  parallel  machines  for 
both  simple  and  complex  Monte  Carlo  calculations. 

4B.   Navier-Stokes  Equations 

One  of  the  next  test  problems  that  we  examined  is  derived 
from  studies  underway  at  CIMS  on  blood  flow  in  the  heart. 
This  involves  numerical  solution  of  the  Navier-Stokes 
equations  of  incompressible  flow  coupled  to  elastic 
deformation  of  a  membrane  (the  heart  muscle) .  We  wrote  a 
parallel  two  dimensional  version  of  an  existing  Navier-Stokes 
code,  in  which  the  membrane  is  simply  an  elastic  band 
represented  by  N  points   whose  motion  is  coupled  to 
that  of  other  particles  and  to  that  of  the  fluid,  which  is 
represented  by  fields  on  an  n  x  n  (periodic)  lattice. 
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The  program  was  written  and  tested  by  serial  execution 
and  was  then  made  parallel. 

A  fast  Fourier  transform  routine  was  prepared  vrhich 
could     be  made  parallel  in  an  effective  way. 
(The  transformation  to  parallelizable  structure  incurred 
a  5%  execution  time  penalty   when  run  with  1  PE. 
Hence  all  efficiencies   ouoted  later  can  aporopriately 
be  multiplied  by  0.95.   This  degradation  will  be  dependent, 
in  a  real  system,  on  details  of  hardware,  software,  and 
in  particular  upon   the  compiler.   In  our  simulator 
execution  of  a  replace-add  requires  a  subroutine  call.) 

A  problem  involving  a  16^16  mesh  and  64  points 
on  the  elastic  band  was  run  with  a  simulated  array   of  up 
to  48  processors.  The  efficiency   fell  to  50%  at  the 
largest  size  array. 

By  making  an  analysis  of  operation  counts  and 
extrapolating  to  larger  problems,  it  is  possible  to  estimate 
the  behavior  of  the  three  dimensional  PDE  computation 
which  is  in  fact  the  target  for  this  research.  The  results 
of  extrapolation  are  shown  in  Table  1.   It  is  seen  there 
that  the  efficiency  with  which  a  very  large  processor  array 
could   be  utilized   would   remain   high   for   a   large 
problem. 
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Table  1. 
Estimated  Efficiency  for  Ultracomputer  Solution 
of  the  Navier  Stokes  Eauation 


#PE 

2D; 

N=2  56; 

n  =  64 

4 

1.00 

32 

0.97 

128 

0.65 

512 

0.20 

4096 

8192 

2,  768 

3D;  N=25  722;  n=64 
1.00 
1.00 
1.00 
0.99 
0.91 
0.69 
0.35 
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6.    Reduction   of  ^  Symmetric  Matrix  to  Tridiagonal  Form 

In  this  exercise  we  transformed  an  existing  algebraic 
utility  (TRED2  of  EISPACK,  an  implementation  of  Householder's 
method)  so  that  it  could  be  used  effectively  in  parallel. 
Table  2  shows  measured  efficiencies  of  reduction  for 
a  32>=  32  matrix  treated  by  varying  numbers  of  processors 
and  N^  N  matrices  treated  by  32  processors. 


Table  2 


Measured  Efficiency  for  Tridiagonal  Matrix  Peduction 


32  X  32  Matrix 

#PE  E 

2  0.99 

4  0.97 

16  0.86 

18  0.66 


N  X  N 

matrix , 

32 

processors 

N 

E 

16 

0.43 

24 

0.63 

48 

0.87 

67 

0.88 
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Again  we  see  that  a  large  problem  can  be  effectively 
treated  using  large  niimbers  of  processors.  This  point 
is  made  more  vivid   by   listing   extrapolated   efficiencies 
derived  from  a  closer  timing   analysis.   These  are  shown 
in  Table  3. 


Table  3 
Estimated  Efficiency  for  Tridiagonal  Matrix  Reduction 


256 

PES 

N 

E 

32 

0.26 

128 

0.74 

512 

0.92 

1024 

0.96 

2048 

PES 

N 

E 

32 

0.04 

128 

0.27 

512 

0.59 

1024 

0.74 
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7.   An  Atmospheric  Modelling  Program. 

A  program  that  models  flow  in  a  vertical  plane  of  the 
atmosphere  was  adapted  to  run  under  our  simulator.  Since   this 
was   the    most  extensive  body  of  existing  software  that 
we   ^«-'orked  with,  it  is  worth  noting  the  steps  taken  to 
make  it  parallel. 

(a)  The  number  of  mesh  points  used  in  the  original  program 
was  reduced  to  fit  the  memory  limitations  imposed  by 
the  simulator  and  the  CDC  6600.   A  serial  version  was 
then  prepared  and  run. 

(b)  A  fully  parallel  code  was  prepared  by  adding  a 
rather  high  level  synchronization.   This  entailed 
18%  loss  of  efficiency  for  a  single  PE. 

(c)  The  parallel  version  was  run  and,  using  the 
monitoring  capability  provided  by  the  simulator, 
the  following  were  m.easured: 

Total  number  of  operations 
Time  lost  waiting  for  synchronization 
Public  vs  private  memory  references 
Number  of  replace-adds 

The  last  two   of   these   items   impact      the 
design  of  the  communications  switch/ while  the  second 
sheds  light  on  the  efficiency  losses  incurred  by  modifying 
the  code  to  make  it  parallel. 


The  algorithm  in  question  used  a  "leapfrog"  integration  in 
time:  at  each  step   an  array  of  new  values  of  field  quantities 
is  computed  from  old  values.  Thus  the  orier  of  conputation 
of  the  new  values  is  irrelevant  and  each  mesh  point  update 
can  be  handled  by  a  separate  processor.  Synchronization 
need  be  imposed  only  at  the  end  of  a  mesh  sweep. 
Because  the  amount  of  arithmetic  associated  with  each 
update  is  not  large,  this  method  is  not  effective  for  small  meshes 
on  the  large  processor  arrays.  But  it  is  effective  for  large 
meshes.  Moreover,  for  a  large  (and  hence  interesting)  mesh 
it  is  possible  (an-^  desirable)  to  assign  several  mesh  points 
to  a  PE  to  attain  still  greater  efficiency. 

Table  4  gives  measured  efficiencies  for  several  meshes.  ' 

Table  4 
Measured  Efficiency  of  the  Two-Dimensional  Atmospheric  Flow 

Model 


MESH 


#    PE 

5  <   16 

6X16 

5^48 

1 

1.00 

1  .00 

1.00 

4 

0.99 

0.99 

1.00 

16 

0.96 

0.96 

0.98                        1 

32 

0.78 

0.89 

0.91 

40 

0.84 

0.74 

0.94 

48 

0.73 

0.83 

0.96 
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It  is  interesting  that  the  efficiencies  are  not  monotone 
in  the  number  of  processors.  For  example,  we  see  a  dip  at 
32  processors  for  the  5 x  16  mesh.   The  reason  for  this  is  as 
follows.   Since  there  are  80  mesh  points,  all  of  the  32 
processors  are  used  to  deal  with  the  first  64  mesh  points, 
but  then  16  of  the  32  are  inactive  waiting  for  the  remaining 
16  mesh  points  to  be  finished  by  16  processors.  When  40  PEs 
are  assigned  to  the  5  x  16  mesh,  they  will  simply  be  used 
twice  over.   Of  course,  all  these  minor  irregularities  will 
be  small  when  there  are  many  more  mesh  points  than  processors. 

The  timing  for  larger  systems  can  be  extrapolated  as 
shown  in  Table  5. 

Table  5 
Projected  Efficiency  for  the  Two-Dimensional 
Atmospheric  Flow  Model 


#PE 

150x5 

350x5 

550x5 

750x5 



2000x5 

200x20 

100 

0.984 

0.993 

0.995 

0.997 

1.000 

1.000 

200 

0.929 

0.968 

0.979 

0.985 

0.999 

0.999 

400 

0.75 

0.94 

0.92 

0.97 

1.00 

1.00 

1000 

0.37 

0.58 

0.68 

0.75 

1.00 

1.00 

4000 

0.13 

0.13 

0.13 

0.13 

0.43 

0.98 



20 


8.    Summary 

We  have  found  no  serious  problems  in  writing  or  adapting 
scientific  programs  in  FORTRAN  for  the  ultracomputer  as  it  is 
presently  envisioned.  Every  problem  treated  has  confirmed  our 
expectation  that  very  large  problems  can  be  dealt  with  effi- 
ciently by  large  processor  arrays.  Of  course,  small  problems 
are  best  left  to  one  or  a  few  PEs.   In  designing  or  transform- 
ing parallel  software   attention  must  be  given  to  the  proper 
balance  between  local  and  distributed  processing.  For  this 
it  may  become  necessary  to  provide  flexible  general  purpose 
algorithms.   This  issue  will  be  less  critical  when  problems 
are  run  under  a  multitasking  operating  system  with  additional 
jobs  available  to  use  the  remaining  resources. 
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